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SLOPE IS ADAPTIVE TO UNKNOWN SPARSITY AND 
ASYMPTOTICALLY MINIMAX 

By Weijie Su* and Emmanuel Candes'I’ 

Stanford University 

We consider high-dimensional sparse regression problems in which 
we observe y = + z, where Y is an n x p design matrix and 

2 is an n-dimensional vector of independent Gaussian errors, each 
with variance Our focus is on the recently introduced SLOPE 
estimator [16], which regularizes the least-squares estimates with the 
rank-dependent penalty Ai|/3|(i), where |/3|{i) is the ith largest 

magnitude of the fitted coefficients. Under Gaussian designs, where 
the entries of X are i.i.d. ^(0,1/?!), we show that SLOPE, with 
weights \i just about equal to ct • — iq/{2p)) (il?“^(a) is the 

Qth quantile of a standard normal and g is a fixed number in (0,1)) 
achieves a squared error of estimation obeying 

sup P (ll^SLOPB -/3||^ > (1-I-e) 2o-^fclog(p/fc)) —> 0 
ll/3|lo<fc ^ ^ 

as the dimension p increases to oo, and where e > 0 is an arbi¬ 
trary small constant. This holds under a weak assumption on the 
£o-sparsity level, namely, k/p —>■ 0 and {k\ogp)/n —> 0, and is sharp 
in the sense that this is the best possible error any estimator can 
achieve. A remarkable feature is that SLOPE does not require any 
knowledge of the degree of sparsity, and yet automatically adapts to 
yield optimal total squared errors over a wide range of ^o-sparsity 
classes. We are not aware of any other estimator with this property. 


1. Introduction. Twenty years ago, Benjamini and Hochberg proposed 
the false discovery rate (FDR) as a new measure of type-I error for multiple 
testing, along with a procedure for controlling the FDR in the case of statisti¬ 
cally independent tests [9]. In words, the FDR is the expected value of the ra¬ 
tio between the number of false rejections and the total number of rejections, 
with the convention that this ratio vanishes in case no rejection is made. To 
describe the Benjamini-Hochberg procedure, henceforth referred to as the 
BHq procedure, imagine we observe a p-dimensional vector y 
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of independent statistics {j/i}, and wish to test which means /3i are nonzero. 
Begin by ordering the observations as |y|(i) > |y|( 2 ) > ••• > I2/I(p) — 
is, from the most to the least significant—and compute a data-dependent 
threshold given by 

tpDR = |y|(_R)5 

where R is the last time |y|(j)/cr exceeds a critical curve formally, 

(1.1) i? = max{i : |y|(i)/o-> A™} with Af“ = (1 - ig/(2p)); 


throughout, 0 < g < 1 is a target FDR level and <1> is the cumulative 
distribution function of a standard normal random variable. (The chance 
that a null statistic z ~ AA(0,1) exceeds A™ is P(|z| > A™) = q ■ i/p.) Then 
BHq rejects all those hypotheses with \yi\ > tp^R and makes no rejection in 
the case where all the observations fall below the critical curve, i.e. when 
the set {i : |y|(j)/(T > Af*^} is empty. In short, the hypotheses corresponding 
to the R most significant statistics are rejected. Letting V be the number of 
false rejections, Benjamini and Hochberg proved that this procedure controls 
the FDR in the sense that 


FDR = E 


V 

RVl 


gpo 

P 


< g, 


where po = |{^ • A = 0}| is the total number of nulls. Unlike a Bonferroni 
procedure—see e.g. [15] —where the threshold for significance is fixed in 
advance, a very appealing feature of the BHq procedure is that the threshold 
is adaptive as it depends upon the data y. Roughly speaking, this threshold 
is high when there are few discoveries to be made and low when there are 
many. 

Interestingly, the acceptance of the FDR as a valid error measure has 
been slow coming, and we have learned that the FDR criterion initially 
met much resistance. Among other things, researchers questioned whether 
the FDR is the right quantity to control as opposed to more traditional 
measures such as the familywise error rate (FWER), and even if it were, they 
asked whether among all FDR controlling procedures, the BHq procedure is 
powerful enough. Today, we do not need to argue that this step-up procedure 
is a useful tool for addressing multiple comparison problems, as both the 
FDR concept and this method have gained enormous popularity in certain 
fields of science; for instance, they have influenced the practice of genomic 
research in a very concrete fashion. The point we wish to make is, however, 
different: as we discuss next, if we look at the multiple testing problem from 
a different point of view, namely, from that of estimation, then FDR becomes 
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in some sense the right notion to control, and naturally appears as a valid 
error measure. 

Consider estimating (3 from the same data y ~ A/’(/3, cr^7p) and suppose 
we have reasons to believe that the vector of means is sparse in the sense that 
most of the coordinates of (3 may be zero or close to zero, but have otherwise 
no idea about the number of ‘significant’ means. It is well known that under 
sparsity constraints, thresholding rules can far outperform the maximum 
likelihood estimate (MLE). A key issue is thus how one should determine 
an appropriate threshold. Inspired by the adaptivity of BHq, Abramovich 
and Benjamini [I] suggested estimating the mean sequence by the following 
testimation procedure:^ use BHq to select which coordinates are worth esti¬ 
mating via the MLE and which do not and can be set to zero. Formally, set 
0 < g < 1 and define the FDR estimate as 


( 1 . 2 ) 


Pi 


Uil \yi\ Pi f^FDR) 

0, otherwise. 


The idea behind the FDR-thresholding procedure is to automatically adapt 
to the unknown sparsity level of the sequence of means under study. Now 
a remarkably insightful article [2] published ten years ago rigorously estab¬ 
lished that this way of thinking is fundamentally correct in the following 
sense: if one chooses a constant q G (0,1/2], then the FDR estimate is 
asymptotically minimax over the class of fc-sparse signals as long as k is nei¬ 
ther too small nor too large. More precisely, take any (3 £MP with a number 
k of nonzero coordinates obeying log^p <k< for any constant <5 > 0. 
Then as p —)■ oo, it holds that 

(1.3) MSE = E||3-/3f < {l + o{l))2a%\og{p/k). 


It can be shown that the right-hand side is the asymptotic minimax risk over 
the class of fc-sparse signals ([2] provides other asymptotic minimax results 
for ip balls) and, therefore, there is a sense in which the FDR estimate 
asymptotically achieves the best possible mean-square error (MSE). This is 
remarkable because the FDR estimate is not given any information about 
the sparsity level k and no matter this value in the stated range, the estimate 
will be of high quality. To a certain extent, the FDR criterion strikes the 
perfect balance between bias and variance. Pick a higher threshold/or a more 
conservative testing procedure and the bias will increase resulting in a loss of 
minimaxity. Pick a lower threshold/or use a more liberal procedure and the 
variance will increase causing a similar outcome. Thus we see that the FDR 


^See [4] for the use of this word. 
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criterion provides a fundamentally correct answer to an estimation problem 
with squared loss, which is admittedly far from being a pure multiple testing 
problem. 

For the sake of completeness, we emphasize that the FDR thresholding es¬ 
timate happens to be very close to penalized estimation procedures proposed 
earlier in the literature, which seek to regularize the maximum likelihood by 
adding a penalty term of the form 

(1.4) argmin \\y — 6||| -|- Pen(||6||o), 

b 

where Pen(/c) = 2klog{p/k) see [38] and [14, 54] for related ideas. In fact, 
[2] begins by considering the penalized MLE with 

Pen{k) = = (1 + o(l)) 2/c log(p/A;), 

i<k 

which is different from the FDR thresholding estimate, and shown to enjoy 
asymptotic minimaxity under the restrictions on the sparsity levels listed 
above. In a second step, [2] argues that the FDR thresholding estimate is 
sufficiently close to this penalized MLE so that the estimation properties 
carry over. 

1.1. SLOPE. Our aim in this paper is to extend the link between esti¬ 
mation and testing by showing that a procedure originally aimed at con¬ 
trolling the EDR in variable selection problems enjoys optimal estimation 
properties. We work with a linear model, which is far more general than the 
orthogonal sequence model discussed up until this point; here, we observe 
an n-dimensional response vector obeying 

(1.5) y = Xf3 + z, 

where X G is a design matrix, /3 G is a vector of regression coeffi¬ 

cients and z ~ AA(0, cr^In) is an error term. 

On the testing side, finding finite sample procedures that would test 
the p hypotheses Hj : f3j = 0 while controlling the FDR—or other mea¬ 
sures of type-I errors—remains a challenging topic. When p < n and the 
design X has full column rank, this is equivalent to testing a vector of 
means under arbitrary correlations since the model is equivalent to /3 ls ~ 
M{(3,a‘^{X'X)~^) (/3ls is the least-squares estimate). Applying BHq pro¬ 
cedure to the least-squares estimate (1) is not known to control the FDR 
(the positive regression dependency [10] does not hold here), and (2) suffers 
from high variability in false discovery proportions due to correlations [16]. 
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Having said this, we are aware of recent significant progress on this prob¬ 
lem including the development of the knockoff filter [6], which is a powerful 
FDR controlling method working when p < n, and other innovative ideas 
[34, 45, 46, 41] relying on assumptions, which may not always hold. 

On the estimation side, there are many procedures available for htting 
sparse regression models and the most widely used is the Lasso [53]. When 
the design is orthogonal, the Lasso simply applies the same soft-thresholding 
rule to all the coordinates of the least-squares estimates. This is equivalent to 
comparing all the p-values to a fixed threshold. In the spirit of the adaptive 
BHq procedure, [16] proposed a new htting strategy called SLOPE, a short¬ 
hand for Sorted L-One Penalized Estimation: hx a nonincreasing sequence 
Ai > A 2 > • • • > Ap > 0 not all vanishing; then SLOPE is the solution to 

( 1 . 6 ) minimize - Xb]]^-L Ail61(i)-L A 2 l 61 ( 2 ) H-hApl61(p), 

where > j 6 j( 2 ) > • • • > \b\{p) are the order statistics of jbij, j 62 |, • • ■ j j^pj- 
The regularization is a sorted ii norm, which penalizes coefficients whose 
estimate is larger more heavily than those whose estimate is smaller. This 
reminds us of the fact that in multiple testing procedures, larger values of 
the test statistics are compared with higher thresholds. In particular, recall 
that BHq compares \y\(i)/cr with A™ = 4>~^(1 — iql2p) —the (1 — iq/2p)t]i 
quantile of a standard normal (for information, the sequence shall play 
a crucial role in the rest of this paper). SLOPE is a convex program and 
[16] demonstrates an efficient solution algorithm (the computational cost of 
solving a SLOPE problem is roughly the same as that of solving the Lasso). 

To gain some insights about SLOPE, it is helpful to consider the orthog¬ 
onal case, which we can take to be the identity without loss of generality. 
When X = Ip, the SLOPE estimate is the solution to 

(1.7) prox;^ {y) = argmin \\\y - bJJ^ -L Ail61(i) H-h Xp\b\(^py, 

b 

in the literature on optimization, this solution is called the prox to the 
sorted ii norm evaluated at y, hence the notation in the left-hand side. (In 
the case of a general orthogonal design in which X'X = Ip, the SLOPE 
solution is prox_;^ {X'y).) Suppose the observations are nonnegative and al¬ 
ready ordered, i.e. yi > ^2 > • • • > 2 /p > 0.^ Then by [16, Proposition 2.2] 

^For arbitrary data, the solution can be obtained as follows: let P be a permutation 
that sorts the magnitudes \y\ in a non-increasing fashion. Then prox_,^ (y) = sgn(y) © 
P~^ prox;^ (Pjj/I), where 0 is componentwise multiplication. In words, we can replace the 
observations by their sorted magnitudes, solve the problem and, finally, undo the ordering 
and restore the signs. 
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SLOPE can be recast as the solution to 

O 81 minimize 5 ||y - A - hf = i - A, - 6 *)^ 

^ ’ subject to bi > b 2 > ■ ■ ■ > bp > 0 

so that it is equivalent to solving an isotonic regression problem with data y— 
A. Hence, methods like the pool adjacent violators algorithm (PAVA) [44, 7] 
are directly applicable. Further, two observations are in order: the first is that 
the fitted values have the same signs and ranks as the original observations; 
for any pair {i,j), yi > yj implies that Pi > Pj. The second is that the fitted 
values are as close as possible to the shrunken observations yt — A* under the 
ordering constraint. Hence, SLOPE is a sort of soft-thresholding estimate 
in which the amount of thresholding is data dependent and such that the 
original ordering is preserved. 

To emphasize the similarities with the BHq procedure, assume that we 
work with Aj = cr • Af^ and that we use SLOPE as a multiple testing 
procedure rejecting Lfj : /3j = 0 if and only if Pi / 0. Then this pro¬ 
cedure rejects all the hypotheses the BHq step-down procedure would re¬ 
ject, and accepts all those the step-up procedure would accept. Under in¬ 
dependence, i.e. y M(P, a'^Ip), SLOPE controls the FDR [16], namely, 
FDR(SLOPE) < qpo/p, where again po is the number of nulls, i.e. of van¬ 
ishing means. 

Figure 1 displays SLOPE estimates for two distinct data sets, with one set 
containing many more stronger signals than the other. We see that SLOPE 
sets a lower threshold of significance when there is a larger number of strong 
signals. We can also see that SLOPE tends to shrink less as observations 
decrease in magnitude. In summary, SLOPE encourages sparsity just as the 
Lasso, but unlike the Lasso its degree of penalization is adaptive to the 
unknown sparsity level. 

1.2. Orthogonal designs. We now turn to estimation properties of SLOPE 
and begin by considering orthogonal designs. Multiplying both sides of (1.5) 
by X' gives the statistically equivalent Gaussian sequence model, 

y = P + z, 

where 2 : ~ AA(0, a'^Ip). Estimating a sparse mean vector from Gaussian data 
is a well-studied problem with a long line of contributions, see [11, 27, 37, 
14, 22, 43] for example. Among other things, we have already mentioned 
that the asymptotic risk over sparse signals is known: consider a sequence 
of problems in which p —>• 00 and k/p ^ 0, then 

Rp{k) = ini sup E ||/3 —/3|p = (1-|-o(l)) 2cr^A:log(p/A:), 

P ||/3||o<fe 
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(a) Weak signals. 


(b) Strong signals. 


Fig 1: Illustrative examples of original observations and SLOPE estimates 
with the identity design. All observations below the threshold indicated by 
the dotted line are set to zero; this threshold is data dependent. 


where the infimum is taken over all measurable estimators, see [28] and [43]. 
Furthermore, both soft or hard-thresholding at the level of a^^2\og{p/k) are 
asymptotically minimax. Such estimates require knowledge of the sparsity 
level ahead of time, which is not realistic. Our first result is that SLOPE 
also achieves asymptotic minimaxity without this knowledge. 

Theorem 1.1. Let X be orthogonal and assume thatp —)> oo with k/p ^ 
0. Fix 0 < O' < 1. Then SLOPE with Xi = a ■ — iq/2p) = a ■ A™ obeys 

(1.9) sup E113 slope-/ 3||^ = (l + o(l))2cr^A;log(p//c). 

ll/3||o<fc 

Hence, no matter how we select the parameter q controlling the FDR level 
in the range (0,1), we get asymptotic minimaxity (in practice we would 
probably stick to values of q in the range [0.05,0.30]). There are notable 
differences with the result from [2] we discussed earlier. First, recall that to 
achieve minimaxity in that work, the nominal FDR level needs to obey q < 
1 /2 (the MSE is larger otherwise) and the sparsity level is required to obey 
log^p < k < p^~^ for a constant d > 0, i.e. the signal cannot be too sparse 
nor too dense. The lower bound on sparsity has been improved to log^'^p 
[59]. In contrast, there are no restrictions of this nature in Theorem 1.1; 
this has to do with the fact that SLOPE is a continuous procedure whereas 
EDR thresholding is highly discontinuous; small perturbations in the data 
can cause the FDR thresholding estimates to jump. This idea may also be 
found in the recent work [42] in which the authors prove that some smooth¬ 
thresholding procedures uniformly achieve asymptotic minimaxity under the 
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same assumptions as in Theorem 1.1. They also establish some optimality 
results for these thresholding rules at a hxed /3. Second, SLOPE effortlessly 
extends to linear models while it is not clear how one would extend FDR 
thresholding ideas in a computationally tractable fashion. 

One can ask which vectors [3 achieve the equality in (1.9), and it is not 
very hard to see that equality holds if the k nonzero entries of (3 are very 
large. Suppose for simplicity that /3i S> ,02 1 and that 

/3fc+i = • • • = /3p = 0. Spacing the nonzero coefficients sufficiently far apart 
will insure that yj — Xj, 1 < j < k, is nonincreasing with high probability so 
that the SLOPE estimate is obtained by rank-dependent soft-thresholding: 

/^SLOPBj' = Vj ~ 

Informally, since the mean-square error is the sum of the squared bias and 
variance, this gives 

mLOPB,,-/3,)'«CT2.((Af)2 + l). 

Since + o(l)) ‘^k\og{p/k )summing this approximation 

over the first k coordinates gives 

E ^ (Aslope, j~ cr^ • (A;+ ^ = (1-h o(l)) 2c7^A;log(p/A:), 

where the last inequality follows from the condition k/p —)• 0. Theorem 1.1 
states that in comparison, the p — k vanishing means contribute a negligible 
MSE. 

We pause here to observe that if one hopes SLOPE with weights Xj to be 
minimax, then they will need to satisfy 

k 

^A^ = {l + o{l))2klog{p/k) 
i=i 

for all k in the stated range. Since A| = ~ Yli=i have that 

xj is roughly the derivative of f{x) = 2xlog{p/x) at x = j yielding Xj ^ 
f{j) = 2 logp - 2 log j - 2, or 

Xj ~ \/2 log{p/j) ^ - jq/2p). 

As a remark, all our results—e.g. Theorems 1.1 and 1.2— continue to hold 
if we replace A?^(g) with ^2 log(p/j). 

^This relation follows from —c) = (l-|-o(l))\/2log(l/c) when c \ 0 and applying 

Stirling’s approximation. 
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We speculate that Theorem 1.1— and to some extent Theorem 1.2 below— 
extend to other loss functions. For instance, from the proofs of Theorem 1.1 
we believe that for r > 1, 

sup E ll^sLOPE - Pfr = (1 + o(l)) • k ■ (2a^ log(p/k)y^^ 

ll/3llo<k 


holds. Furthermore, examining the proof of Theorem 1.1 reveals that for all 
k not necessarily obeying k/p ^ 0 (e.g. k = pj2), 


suP||/3||o<fcJE|l/^ SLOPE 

Rp{k) 


< Ciq), 


where C{q) is a positive numerical constant that only depends on q. 


1.3. Random designs. We are interested in getting results for sparse re¬ 
gression that would be just as sharp and precise as those presented in the 
orthogonal case. In order to achieve this, we assume a tractable model in 
which X is a Gaussian random design with Xij i.i.d. M{0,1/n) so that the 
columns of X have just about unit norm. Random designs allow to analyze 
fine structures of the models of interest with tools from random matrix the¬ 
ory and large deviation theory, and are very popular for analyzing regression 
methods in the statistics literature. An incomplete list of works working with 
Gaussian designs would include [20, 5, 13, 21, 58, 8, 31]. On the one hand, 
Gaussian designs are amenable to analysis while on the other, they capture 
some of the features one would encounter in real applications. 

To avoid any ambiguity, the theorem below considers a sequence of prob¬ 
lems indexed by {kj,nj,pj), where the number of variables pj —)• oo, kj jpj — )• 
0 and (kj logpj)/nj —>■ 0. From now on, we shall omit the subscript. 


Theorem 1.2. Fix 0 < g < 1 and set A = cj(l -L e)A®'^(g) for some 
arbitrary eonstant 0 < e < 1. Suppose k/p — )• 0 and {klogp)/n — )■ 0. Then 


( 1 . 10 ) 


sup P 

m\o<k 


f ||/3slope ~ 

y 2a‘^k\og(p/k) 


0 . 


For information, it is known that under some regularity conditions on the 
design [50, 57], the minimax risk is on the order of 0{a‘^k\og(p/k)), without 
a tight matching in the lower and upper bounds. Against this, our main 
result states that SLOPE, which does not use any information about the 
sparsity level, achieves a squared loss bounded by (1 -|- o{l))2a‘^k\og{p/k) 
with large probability. This is the best any procedure can do as we show 
next. 
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Theorem 1.3. Under the assumptions of Theorem 1.2, for any e > 0, 
we have 

( 


inf sup - 1 „ o 7 1 
^ ll/3||o<fc k\og{p/k) 


> 1 - e 


Similar results dealing with arbitrary designs can be found in the litera¬ 
ture, compare Theorem 1 in [60]. However, the notable difference is that our 
theorem captures the exact constants in addition to the rate. 

Taking Theorems 1.2 and 1.3 together demonstrate that in a probabilistic 
sense 2a‘^k\og{p/k) is the fundamental limit for the squared loss and that 
SLOPE achieves it. It is also likely that our methods would yield correspond¬ 
ing bounds for the expected squared loss but this would involve technical 
issues having to do with the bounding of the loss on rare events. This be¬ 
ing said. Theorem 1.2 provides a more accurate description of the squared 
error than a result in expectation since it asserts that the error is at most 
2 (7^ A: log (p/A:) with high probability. The proof of this fact presents several 
novel elements not found in the literature. 

The condition (A: logp)/re —)• 0 is natural and cannot be fundamentally 
sharpened. To start with, our results imply that SLOPE perfectly recovers (3 
in the limit of vanishing noise. In the high-dimensional setting where p > n, 
this connects with the literature on compressed sensing, which shows that in 
the noiseless case, re > 2{1 +o{l))klog{p/k) Gaussian samples are necessary 
for perfect recovery by ii methods in the regime of interest [32, 33]. Our 
condition is a bit more stringent but naturally so since we are dealing with 
noisy data. 

We hope that it is clear that results for orthogonal designs do not imply 
results for Gaussian designs because of (1) correlations between the columns 
of the design and (2) the high dimensionality. Under an orthogonal design, 
when there is no noise, one can recover (3 by just computing X'y. However, 
as discussed above it is far less clear how one should do this in the high¬ 
dimensional regime when p ^ re. As an aside, with noise it would be foolish 
to find P via prox_)^ (X'y); that is, by applying X' and then pretending that 
we are dealing with an orthogonal design. Such estimates turn out to have 
unbounded risks. 

We remark that a preprint [36] considers statistical properties of a general¬ 
ization of OSGAR [18] that coincides with SLOPE. The findings and results 
are very different from those presented here; for instance, the selection of 
optimal weights Xi is not discussed. 

Finally, to see our main results under a slightly different light, suppose 
we get a new sample independent from the ‘training set’ {X,y), 
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obeying the linear model y* = {x*,P) + az* with x Af{0,n ^Ip) and 
z* ~ A/’(0, cr^). Then for any estimate (3, the prediction y = {x*, (3) obeys 

ny*-V? = n-^n\l3-pf + a\ 

so that, in some sense, SLOPE with BH weights actually yields the best 
possible prediction. 

1.4. Back to multiple testing. Although our emphasis is on estimation, 
we would nevertheless like to briefly return to the multiple testing viewpoint. 
In [16, 17], a series of experiments demonstrated empirical FDR control 
whenever (3 is sufficiently sparse. While this paper does not go as far as 
proving that SLOPE controls the FDR in our Gaussian setting, the ideas 
underlying the proof of Theorem 1.2 have some implications for FDR control. 
Our discussion in this section is less formal. 

Suppose we wish to keep the false discovery proportion (FDP) FDP = 
Vj (i? V 1) < q. Since the number of true discoveries i? — R is at most k, the 
false discovery number V = {i : I3i = 0 and /Sslope.i 7 ^ 0} must obey 

(1.11) V<Y^k. 

Interestingly, an intermediate result of the proof of Theorem 1.2 implies 
that (1.11) is satisfied with probability tending to one if k is sufficiently 
large and q is replaced by (1 + o{l))q. This is shown in Lemma 4.4. Another 
consequence of our analysis is that if the nonzero regression coefficients are 
larger than l.l(TA™(g) (technically, we can replace 1.1 with any fixed num¬ 
ber greater than one), then the true positive proportion (the ratio between 
the number of true discoveries and k) approaches one in probability. In this 
setup, we thus have FDR control in the sense that 

FDRslopb ^ (1 + o(l))9' 

Figure 2 demonstrates empirical FDR control at the target level q = 
0.1. Over 500 replicates, the averaged FDR is 0.09, and the averaged false 
discovery number V is 9.4, as compared with 11.1, the upper bound in (1.11). 
We emphasize that [16, 17] also provide strong evidence that FDR is also 
controlled for moderate signals. 

Since our paper proves that SLOPE does not make a large number of 
false discoveries, the support of /3slopb is of small size, and thus we see 
that ||X(/3slope — P)\\'^ is very nearly equal to ||/3slopb ~ /3|P since skinny 
Gaussian matrices are near isometries. Therefore, we can carry our results 
over to the estimation of the mean vector X(3. 
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(a) Histogram of FDP. 


(b) Histogram of V. 


Fig 2: Gaussian design with (n,p) = (8,000,10,000) and cr = 1. There are 
k = 100 nonzero coefficients with amplitudes 10-v/2To^. Here, the nominal 
level is q = 0.1 and A = 1.1A®^(0.1). 


Corollary 1.4. Under the assumptions of Theorem 1.2, 



As before, there are matching lower bounds: for these, it suffices to restrict 
attention to estimates of the form fi = X/3 since projecting any estimator 
/2 onto the column space of X never increases the loss. 

Corollary 1.5. Assume k/p —)• 0 and p = 0(n). Then 



Again, SLOPE is optimal for estimating the mean response, and achieves 
an estimation error which is the same as that holding for the regression 
coefficients themselves. 

1.5. Organization and notations. In the rest of the paper, we briefly 
explore possible alternatives to SLOPE in Section 2. Section 3 concerns 
the estimation properties of SLOPE nnder orthogonal designs and proves 
Theorem 1.1. We then turn to study SLOPE under Gaussian random designs 
in Section 4, where both Theorem 1.2 and Corollary 1.4 are proved. Last, 
we prove corresponding lower bounds in Section 5, including Theorem 1.3. 
Corollary 1.5 and auxiliary results are proved in the Appendix. 
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Recall that p, n, k are positive integers with p —)■ oo, but not necessarily 
so for k. We use S for the complement of S. For any vector a, define the 
support of a as supp(a) = {i : a* 7^ 0}. A bold-faced A denotes a general 
vector obeying Ai > A2 > • • • > Ap > 0, with at least one strict inequality. 
For any integer 0 < m < p, At™] = (Ai,..., Am) and A”!™'] = (Am-i-i, • • •, Ap). 
We write A^ (the superscript is omitted to save space) for the e-inflated BHq 
critical values, 


A,p = (1 + e)Ar = (1 + 6)<1>-1 (1 - iq/{2p )). 

Last and for simplicity, (3 is the SLOPE estimate, unless specified otherwise. 


2. Alternatives to SLOPE?. It is natural to wonder whether there 
are other estimators, which can potentially match the theoretical perfor¬ 
mance of SLOPE for sparse regression. Although getting an answer is be¬ 
yond the scope of this paper, we pause to consider a few alternatives. 


2.1. Other ii penalized methods. The Lasso, 

minimize - ||y-X6||2 + A||6||i, 

b 2 

serves as a building block for a lot of sparse estimation procedures. If A is 
chosen non adaptively, then a value equal to (1 — c) • a^/Tlo^ for 0 < c < 1 
would cause a large number of false discoveries even under the global null 
and, consequently, the risk when estimating sparse signals would be high. 
This phenomenon can already be seen in the orthogonal case [37, 43]. This 
means that if we choose A in a non-adaptive fashion then we would need 
to select A > cT\/2do^. Under the assumptions of Theorem 1.2 and setting 
A = (1 -b c) • ay/2 logp for an arbitrary positive constant c gives 


( 2 . 1 ) 


sup P 

m\o<k 


ll/9L..so-/3f 
2a‘^k logp 



1 . 


The proof is in Appendix A.l. Hence the risk inflation does not decreases as 
the sparsity level k increases, whereas it does for SLOPE. Note that when 
p = n and k = p^~^, 

2a‘^k\ogp 1 

2a'^k\og{p/k) 5" 


The reason why the Lasso is suboptimal is that the bias is too large (the 
fitted coefficients are shrunk too much towards zero). All in all, by our earlier 
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considerations and by letting (5 —)• 0 above, we conclude that no matter how 
we pick A non-adaptively, the ratio 

max risk of Lasso 
max risk of SLOPE 

in the worst case over k. 

Figures 3a and 3b compare SLOPE with Lasso estimates for both strong 
and moderate signals. SLOPE is more accurate than the Lasso in both cases, 
and the comparative advantage increases as k gets larger. This is consistent 
with the reasoning that SLOPE has a lower bias when k gets larger. 

Of course, one might want to select A in a data-dependent manner, per¬ 
haps by cross-validation (see next section), or by attempting to control a 
type-I error such as the FDR. For instance, we could travel on the Lasso 
path and stop ‘at some point’. Some recent procedures such as [46] make 
very strong assumptions about the order in which variables enter the path 
and are likely not to yield sharp estimation bounds such as (1-10)— provided 
that they can be analyzed. Others such as [40] are likely to be far too con¬ 
servative. In a different direction, it would be interesting to compare SLOPE 
with the Lasso in different settings, where perhaps both k/p and n/p con¬ 
verge to positive constants. While some tools have been developed for the 
Lasso in this asymptotic regime [8], it is unclear how SLOPE would behave 
and even what a good sequence of weights {Aj} might be in this case. 

2.2. Data-driven procedures. While hnding tuning parameters adaptively 
is an entirely new issue, a data-driven procedure where the regularization 
parameter of the Lasso is chosen in an adaptive fashion would presumably 
boost performance. Cross-validation comes to mind whenever applicable, 
which is not always the case as when y ~ Cross-validation 

techniques are also subject to variance effects and may tend to select over¬ 
parameterized models. To make the selection of the tuning parameter as 
easy and accurate as possible, we work in the orthogonal setting where we 
have available a remarkable unbiased estimate of the risk. 

SURE thresholding [29] for estimating a vector of means from y M{P, a'^Ip) 
is a cross-validation type procedure in the sense that the thresholding pa¬ 
rameter is selected to minimize Stein’s unbiased estimate of risk (SURE) 
[52]. For soft-thresholding at A, SURE reads 

p 

SURE(A) =pa^ + Y^ yf A A^ - 2a‘^#{i : \yi\ < A}. 

i=l 

One then applies the soft-thresholding rule at the minimizer A of SURE(A). 

It has been observed [29, 22] that SURE thresholding loses performance in 
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* SLOPE J 


i SLOPE y 

• SURE ^ * 


• SURE j ^ 


MSE 

0 50 100 150 



Sparsity 


(c) Strong signals. 


Sparsity 

(d) Moderate signals. 



MSE 


(e) Strong signals with k = 1. 



MSE 


(f) Moderate signals with k = 1. 


Fig 3: (a) and (b) compares between SLOPE and Lasso under Gaussian 
design with (n,p) = (500,1000) and a = 1. The risk E \\f3 — /3|p is averaged 
over 100 replicates. SLOPE uses A = and Lasso uses A = A™(g) with 

level q = 0.05. In (a), the components have magnitude lOA™; in (b), the 
magnitudes are set to O.SA™. Next, (c, d, e, f) compare SLOPE with SURE 
under orthogonal design. Empirical distributions of ||/3 — /3|p is obtained 
from 10,000 replicates. Strong signals have nonzero j3i set to 100\/2 logp 
while this value is 0.8\/2 logp for moderate signals. In (c) and (d), the bars 
represent 75% and 25% percentiles. 
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cases of sparse signals f3, an empirical phenomenon which can perhaps be 
made theoretically precise. Indeed, our own work in progress aims to show 
that for any fixed sparsity k, SURE thresholding obeys 

suP||/3||o<fc JE UNSURE -/3|p > 1 

sup||^||o<fcIE||^sLOPE-/3P “ ^ 

where k is allowed to take the value zero and (A: + 1)/A; = oo in this case. In 
particular, SURE has a risk that is infinitely larger than SLOPE under the 
global null /3 = 0. 

Eigure 3 compares SLOPE with SURE in estimation error. In Eigures 
3c and 3d, we see that SURE thresholding exhibits a squared error, which 
is consistently larger in mean (risk) and variability. This difference is more 
pronounced, the sparser the signal. Eigures 3e and 3f display the error distri¬ 
bution for A; = 1; we see that the error of SURE thresholding is distributed 
over a longer range. 

2.3. Variations on FDR thresholding. As brought up earlier, the paper 
[42] suggests a variation on EDR thresholding, where an adaptive smooth¬ 
thresholding rule is applied instead of a hard one. Such a procedure is still 
intrinsically limited to sequence models, and cannot be generalized to linear 
regression. On this subject, consider the sequential FDR thresholding rule, 

3seq,i = sgn(2/i) • (^\yi\ - 

where r{i) is the rank of yi when sorting the observations by decreasing 
order of magnitude; that is, we apply soft-thresholding at level to the 
ith. largest observation (in magnitude). Under the same assumptions as in 
Theorem 1.1, this estimator also obeys 

(2.2) sup E ll^seq -/3|P = (1 + o(l)) 2cr^A;log(p/A;). 

m\o<k 

The proof is in Appendix A.l and resembles that of Theorem 1.1. Even 
though the worst case performance of this estimate matches that of SLOPE, 
it is not a desirable procedure for at least two reasons. The first is that it 
is not monotone; we may have \yi\ > \yj\ and |/3jj > |/3j|, which does not 
make much sense. A consequence is that it will generally have higher risk. 
Also note that this estimator is not continuous with respect to y, since a 
small perturbation can change the ordering of magnitudes and, therefore, 
the amount of shrinkage applied to an individual component. The second 
reason is that this procedure does not really extend to linear models. 
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3. Orthogonal designs. This section proves the optimality of SLOPE 
under orthogonal designs. As we shall see, the proof is considerably shorter 
and simpler than that in [2] for FDR thresholding. One reason for this is 
that SLOPE continuously depends on the observation vector while FDR 
thresholding does not, a fact which causes serious technical difficulties. The 
discontinuities of the FDR hard-thresholding procedure also limits the range 
of its effectiveness (recall the limits on the range of sparsity levels which state 
that the signal cannot be too sparse or too dense) as false discoveries result 
in large squared errors. 

A reason for separating the proof in the orthogonal case is pedagogical 
in that the argument is conceptually simple and, yet, some of the ideas and 
tools will carry over to that of Theorem 1.2. From now on and throughout 
the paper we set u = 1. 

3.1. Preliminaries. We collect some preliminary facts, which will prove 
useful, and begin with a dehnition used to characterize the solution to 
SLOPE. 

Definition 3.1. A vector a G is said to majorize 6 G M?* if for all 
i = l,...,p, 

|a|(i) H-h |o|(j) > |6|(i) -I-1- 


This differs from a more standard dehnition—e.g. see [47]— where the 
last inequality with i = p is replaced by an equality (and absolute values 
are omitted). We see that if a majorizes b and c majorizes d, then the 
concatenated vector (a,c) majorizes {b,d). For convenience, we list below 
some basic but nontrivial properties of majorization and of the prox to 
the sorted norm as dehned in (1.7). All the proofs are deferred to the 
Appendix. 


Fact 3.1. If a majorizes b, then ||a|| > ||6||. 

Fact 3.2. If \ majorizes a, then prox_;^ (a) = 0. 

Fact 3.3. The dijferenee a — prox_;^ (a) is majorized by A. 

Fact 3.4. Let T be a nonempty proper subset of {1,... ,p}, and reeall 
that ax is the restriction of a to T and A”!™'] = (Am+i, • • •, Ap). Then 

||[prox;,(a)]j.|| < ||prox;^-[|T|] {ajfjW . 
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Lemma 3.1. For any a, it holds that 

II Prox;, (a) II < ||(|a| - A)+||, 

where |a| is the vector of magnitudes (|oi|,..., |ap|). 

Proof of Lemma 3.1. The firm nonexpansiveness (e.g. see pp.l31 of 
[49]) of the prox reads 

llprox;^ (a) - prox;^ (6)f < (a - b)' {piox^ (a) - prox;^ (b)) 

for all a, b. Taking b = sgn(a)0A, where © is componentwise multiplication, 
and observing that prox_;^ (b) = 0 (Fact 3.2) give 

llprox;^ (a)f < (sgn(a) 0 (jaj - A),prox;^ (a)) 

< ((l«l - A)+, sgn(a) 0 prox;, (a)) 

< ll(|a| - A)+|| • llprox;^ (a)||, 

where we use the nonnegativity of sgn(a) © prox_;^ (a) and the Cauchy- 
Schwarz inequality. This yields the lemma. □ 

3.2. Proof of Theorem 1.1. Let S be the support of the vector P, S = 
supp(/3), and decompose the total mean-square error as 

E \\p - pf = E WPs - Psf + E ||4 - p^f, 

i.e. as a the sum of the contributions on and off support (in case ||/3||o < k, 
augment S to have size k). Theorem 1.1 follows from the following two 
lemmas. 

Lemma 3.2. Under the assumptions of Theorem 1.1, for all k-sparse 
vectors P, 

- Psf < {l + o{l))2klog{p/k). 

Proof. We know from Fact 3.3 that y — P is majorized by A = A®^, 
which implies that ys — Ps = Ps + — Ps is majorized by A^^l. The 
triangle inequality together with Fact 3.1 give 
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This gives 


E 11/35-^511 <Y^{Xfy+E\\zsf + 2 (AD'E 11^51 


2=1 

k 


l<i<k 


< +n^s\?+‘^JY. ^ iiz5i 


2=1 

k 


l<i<k 


<Y^{\r?+k+2 k j^(Af 


BH'l2 


i=l 


l<i<k 


= (1 + o(l)) 2klog{p/k), 

where the last step makes use of = (1 + o{l))2klog{p/k) and 

^og{p/k) —>• oo. □ 

Lemma 3.3. Under the assumptions of Theorem 1.1, for all k-sparse 
vectors (3, 


(3.1) 


E\0s-(3gf = o{l)2klog{p/k). 


Proof. It follows from Fact 3.4 that 

llfell^ = ||[prOXA (l/)]5|f < ||prOX;^_M (2;;5)||^ 

We proceed by showing that for C ~ E || prox_^-[ii] ((^) ||^ = 

o{l) 2k\og{p/k). To do this, pick A > 0 sufficiently large such that g(l + 
1/^4) < 1 in Lemmas A.3 and A.4, which then give 

p—k 

X]®' “ ^fc+*)+ = 2A:log(p/A;). 

2=1 

The conclusion follows from Lemma 3.1 since 


p—k 

E||prox;^-W (Of < = o{l)2klog{p/k). 

2=1 


□ 

We conclude this section with a probabilistic bound on the squared loss. 
The proposition below, whose argument is nearly identical to that of Theo¬ 
rem 1.1, shall be used as a step in the proof of Theorem 1.2. 
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Proposition 3.4. Fix 0 < g < 1 and set A = (1 + e)A®^(g) for some 
arbitrary 0 < e < 1. Suppose k/p ^ 0, then for each 6 > 0 and all k-sparse 

M,2(1 + t)2fc log(j)/t) ^ ^ ^ 

Here, the convergence is uniform over e. 


Proof. We only sketch the proof. As in the proof of Lemma 3.2, we have 

||/35-3sII<I|aW|| + ||z5||. 

Since \\X^^^\\ = (l + o(l)) • (l + e)y^2A:log(p//c) and ||z5|| = op(y^2/i; log(p/A:)), 
we have that for each 5 > 0, 


\\Ps-Ps\(^ 


2(1 + eyklog{p/k) 


<1 + 6/2 


1 . 


Since A has increased, it is only natural that the off-support error remains 
under control. In fact, (3.1) still holds, and the Markov inequality then gives 


ll%-/%|p A 

2klog{p/k) 2 J 


1 . 


This concludes the proof. 


□ 


4. Gaussian random designs. When moving from an orthogonal to 
a non-orthogonal design, the correlations between the columns of X and 
the high dimensionality create much difficulty. This is already apparent 
when scanning the literature on penalized sparse estimation procedures 
such as the Lasso, SCAD [35], the Dantzig selector [25] and MC-|- [61], 
see e.g. [39, 24, 62, 23, 12, 55, 58, 48, 60, 8, 30] for a highly incomplete 
list of references. For example, a statistical analysis of the Lasso often re¬ 
lies on several ingredients: first, the Karush-Kuhn-Tucker (KKT) optimality 
conditions; second, appropriate assumptions about the designs such as the 
Gaussian model we use here, which guarantee a form of local orthogonal¬ 
ity (known under the name of restricted isometries or restricted eigenvalue 
conditions); third, the selection of a penalty A several times the size of the 
universal threshold a\/2 logp, which while introducing a large bias yielding 
MSEs that cannot possibly approach the precise bounds we develop in this 
paper, facilitates the analysis since it effectively sets many coordinates to 


zero. 
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Our approach must be different for at least two reasons. To begin with, 
the KKT conditions for SLOPE are not easy to manipulate. Leaving out 
this technical matter, a more substantial difference is that the SLOPE reg¬ 
ularization is far weaker than that of a Lasso model with a large value of 
the regularization parameter A. To appreciate this distinction, consider the 
orthogonal design setting. In such a simple situation, it is straightforward 
to obtain error estimates about a hard thresholding rule set at—or several 
times—the Bonferroni level. Getting sharp estimates for FDR thresholding 
is entirely a different matter, compare the far longer proof in [2]. 

4.1. Architecture of the proof. Our aim in this section is to provide a 
general overview of the proof, explaining the key novel ideas and intermedi¬ 
ate results. At a high level, the general structure is fairly simple and is as 
follows: 

1. Exhibit an ideal estimator (3, which is easy to analyze and achieves 
the optimal squared error loss with high probability. 

2. Prove that the SLOPE estimate [3 is close to this ideal estimate. 

We discuss these in turn and recall that throughout, A = (1 -|- e)A®^(g). 

A solution algorithm for SLOPE is the proximal gradient method, which 
operates as follows: starting from an initial guess G M^, inductively 
define 

= proxi^;, (b(-) - - y)) , 

where {tm} is an appropriate sequence for step sizes. It is empirically ob¬ 
served that under sparsity constraints, the proximal gradient algorithm for 
SLOPE (and Lasso) converges quickly provided we start from a good initial 
point. Here, we propose approximating the SLOPE solution by starting from 
the ground truth and applying just one iteration; that is, with to = 1, define 

(4.1) (3 := prox;^ (/3 -|- X'z) . 

This oracle estimator (3 approximates the SLOPE estimator (3 well—they 
are equal when the design is orthogonal—and has statistical properties far 
easier to understand. The lemma below is the subject of Section 4.2. 

Lemma 4.1. Under the assumptions of Theorem 1.2, for all k-sparse f3, 
we have _ 

p(_ 

((1 + 0^2* \og(plk) + 0 j ^ 


where 6 > 0 is an arbitrary constant. 
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Since we know that (3 is asymptotically optimal, it suffices to show that 
the squared distance between (3 and (3 is negligible in comparison to that 
between /3 and (3. This captured by the result below, whose proof is the 
subject of Section 4.3. 

Lemma 4.2. Let T C {1,... ,p} 6e a subset of columns assumed to con¬ 
tain the supports of (3, (3 and [3; i.e. T D supp(/3) U supp(/3) U supp(/3). 
Suppose all the eigenvalues of X'rpXx lie in [1 — 5,1 + 5] for some S < 1/2. 
Then 

\\P-M<r^^\\P-(3f. 

In particular, P = P under orthogonal designs. 

We thus see that everything now comes down to showing that there is a 
set of small cardinality rantaining the supports of P, P and p. While it is 
easy to show that supp(/3) U supp(/3) is of small cardinality, it is delicate to 
show that this property still holds with the addition of the support of the 
SLOPE estimate. Below, we introduce the resolvent set, which will prove to 
contain supp(/3) U supp(/3) U supp(/3) with high probability. 

Definition 4.1 (Resolvent set). Fix S = supp(/3) of cardinality at most 
k, and an integer k* obeying k < k* < p. The set S* = S*{S,k*) is said 
to be a resolvent set if it is the union of S and the k* — k indices with the 
largest values of \X[z\ among all i G {1,... ,p} \ 5. 

Under the assumptions of Theorem 1.2, we shall see in Section 4.4 that 
we can choose k* in such a way that on the one hand fe* is sufficiently small 
compared to p and n/logp, and on the other, the resolvent set 5* is still 
expected to contain supp(/3) (easier) and supp(/3) (more difficult). Formally, 
Lemma 4.4 below shows that 

(4.2) inf P ("supp(/3) U supp(/3) U supp(/3) C S*') —)• 1. 

\\f3\\o<k V / 

One can view the resolvent solution as a sophisticated type of a dual certifi¬ 
cate method, better known as primal-dual witness method [58, 23, 51] in the 
statistics literature. A significant gradation in the difficulty of detecting the 
support of the SLOPE solution a priori comes from the false discoveries we 
commit because we happen to live on the edge, i.e. work with a procedure 
as liberal as can be. 

With (4.2) in place. Theorem 1.2 merely follows from Lemma 4.2 and the 
accuracy of P explained by Lemma 4.1; all the bookkeeping is in Section 
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4.5. Furthermore, Corollary 1.4 is just one stone throw away, please also see 
Section 4.5 for all the necessary details. 


4.2. One-step approximation. The proof of Lemma 4.1 is an immediate 
consequence from Proposition 3.4. In brief, Borell’s inequality—see Lemma 
A.5 —provides a well-known deviation bound about chi-square random vari¬ 
ables, namely, 

P (ll^ll < (1 + e)\/^) > 1 - ^ 1 

since e^n —)• oo. Hence, to prove our claim, it suffices to establish that 


(4.3) 


/ llprox;^^ {p + x'z) - 

\ {1-\-eY2k\og{p/k) 


< 1 


< (1 + 


^ 1 . 


Conditional on ||z|| = c^/n for some 0 < c < 1 -|- e, X'z AA(0, c^Ip) and, 

therefore, conditionally. 


II prox;^^^ (/3 -L X'z) - (3\\ = || prox;^^^ {jS + cN (0, Ip)) - (3\\ 

= c|| prox;^^, (/3/c -L AA(0, Ip)) - (3/c\\ 

for e' = (1 -I- e)/c — 1 > 0. Hence, Proposition 3.4 gives 

/'ll prox;^, (/3/c-FAA(0, Ip)) -/3/c"2 ^ 


y {1-^ e')‘^2k\og{p/k) 

Since (1 -|- e)^/c^ = (1 -|- e')^, this is equivalent to 

^ / c^ll prox;^^^, (/3/c-LAA(0, Ip)) - (3/cf‘ 

i (1-|-e)2 2A:log(p/fc) 


< 1 -L 5 


< 1 -L (5 


1 . 


This completes the proof since we can deduce (4.3) by averaging over \\z 


4.3. /3 and /3 are close when X is nearly orthogonal. We prove Lemma 
4.2 in the case where T = {1,... ,p}, first. Set J\{h) = ^y 

definition /3 and (3 respectively minimize 

Li(b) := \\\X{(3 - b)f + z'X(/3 - b) + Jx{b) 

L 2 {b) := i||/3 - bf + z'X{l3 -b) + Jx{b). 







24 


W. SU AND E. CANDES 


Next the assumptions about the eigenvalues of X'X implies that these two 
functions are related, 

L20) - ^ 11/9 - Pf < Li0) < LM + ^ 11/3 - Pf, 

L2{P) - ^ 11/3 - Pf < lS) < L2{P) + ^ 11/3 - • 

Chaining these inequalities gives 


(4.4) 


L2{P) + 


m-pf 

2 


> Liip) > lS) > 120) 


m-pf 

2 


Now the strong convexity of L 2 also gives 


L20)>L2{P) + 


wp-pr 

2 


and plugging this in the right-hand side of (4.4) yields 


f45) 11/3 s\\p-m^ ^ 0\\p-pr 

^' 2 2 “ 2 ' 

Since 6\\P — /3|p/2 < 6\\P — /3|p -|- 5\\P — /3|p (this is essentially the basic 
inequality (a -|- 6)^ < 2a^ -|- 26^), the conclusion follows. 

We now consider the general case. Let m be the cardinality of T and for 
h G M™, set PP — YliKiKm observe that by assumption, (Bt 

and Pt are solutions to the reduced problems 


(4.6) argmin \\\y - XrhW^ + 

beKl^l 

and 

argmin \\\I3t + X'rpZ - h\\^ + 
beiRl^l 

Using the fact that Xfi = XtPt, we see that /3 t and Pt respectively 
minimize 


Li(6) := \\\Xt{Pt - h)f + z'XT{pT - h) + J;,m(&) 
L2{h) := \\\(3T-hf + z'XT{pT-h) + J^y^Ph). 


From now on, the proof is just as before. 
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4.4. Support localization. Below we write a ^ 6 as a short-hand for b 
majorizes a and 

(4.7) = supp(/3) U supp(/3) U supp(/3). 

Lemma 4.3 (Reduced SLOPE). Let bx be the solution to the reduced 
SLOPE problem (4.6), which only fits regression coefficients with indices in 
T. If 

(4.8) Xh(y-XTST)^ 

then it is the solution to the full SLOPE problem in the sense that (3 defined 
as I3 t = bx and j3^ = 0 is solution. 

The inequality (4.8), which implies localization of the solution, reminds 
us of a similar condition for the Lasso. In particular, if Ai = A2 = • • • = Ap, 
then SLOPE is the Lasso and (4.8) is equivalent to \\XE{y — Xxbx)\\oo < A. 
In this case, it is well known that this implies that a solution to the Lasso 
is supported on T, see e.g. [58, 23, 51]. 

The main result of this section is this: 

Lemma 4.4. Suppose 

f 1 + c 

k* > max <- k,k + d 

[1-q 

for an arbitrary small constant c > 0, where d is a deterministic sequence 
diverging to infinity'^ in such a way that k*/p —)• 0 and {k*logp)/n —)• 0. 
Then 

inf ¥(S^ C S*) ^1. 

m\o<k 

Proof of Lemma 4.4. By construction, supp(/3) C S* so we only need 
to show (i) supp(/3) C S* and (ii) supp(/3) C 5*. We begin with (i). By 
Lemma 4.3, supp(/3) is contained in S* if 

X^iy - Xs^^S*) ^ 

which would follow from 

(4.9) X^Xs^Ps^ - M ^ I (Af?+i,..., A^») 

^Recall that we are considering a sequence of problems with {kj, nj,pj) so that this is 
saying that k* > max(2(l — q)~^kj, kj + dj) with dj —>■ 00 . 
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and 


(4.10) ^ (1 + e/2) ..., A^») . 


Lemma A. 12 in Appendix concludes that (4.9) holds with probability tend¬ 
ing to one, since, by assumption, e > 0 is constant and sj{k* logp)/n —>• 0 . 
To show that (4.10) also holds with probability approaching one, we resort 
to Lemma A.9. Conditional on x, ~ AA(0, \\z\\‘^/n- Ip-k)- By definition, 
is formed from by removing its k* — k largest entries in absolute 
value. Denoting by Ci, ■ ■ ■ > Cp-k i-i.d. standard Gaussian random variables, 

(4.10) thus boils down to 

(4.11) 


(ICl (fc*—fc-l-l)) ICl (A:*— fc-l-2)) • • ■ ) ICl(p—fc)) — 


(1 + e/2)\/n 


(A 


BH 

fc* + l) ' 


\BH 

I ■^p 


)■ 


Borell’s inequality (Lemma A. 5) gives 

P ((1 + e/2)^/^/\\z\\ < 1 ) = P (||z|| - Vn > eVn/2) < ^ 0 . 


The conclusion follows from Lemma A.9. 
We turn to (ii) and note that 


i(3 + x'z)^ = xyz. 

— ffc*l 

Now our previous analysis implies X^z ^ with probability tending 

to one. However, it follows from Facts 3.4 and 3.2 that 

ll(3;^ll = II Prox;^^ (/3 + X'z)^ || < || prox^-[;=*] {X'-^z) || = 0 . 

In summary, X'-^z ^ ^ supp(/3) C S*. This concludes the proof. 

□ 


4.5. Proof of Theorem 1.2 and Corollary 1.4- Put 


(5 


1 + 3e e - e2 

(l + e )2 “(l + e )2 


> 0 , 


and choose any 6' > 0 such that 


(1 + (5 ) ?> 5 '/ (1 — 25 ') T (1 T ^/‘^) ^ (1 T A). 

Let be the event <Z S*, £^2 that all the singular values of Xs* lie in 
[\/l — 5', \/l + k , and ^ 2/3 that 


ll/3-/3f 

(1-|-e)2 2A:log(p/A;) 2 ’ 
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We prove that each event happens with probability tending to one. For 
ij/i, use Lemma 4.4, and set 

d = min 

which diverges to oo, and 

k* = max{[ 2 ^/(1 — q)~\ ,k + d} . 


kn/log p , [VpJ I , 


It is easy to see that k* satisfies the assumptions of Lemma 4.4, which asserts 
that —)• 1 uniformly over all /c-sparse /3. For since {k* \ogp)/n —)• 0 

implies that k* \og{p/k*)/n —)• 0, then taking t sufficiently small in Lemma 
A. 11 gives ¥’{£^ 2 ) 1 uniformly over all /c-sparse (3. Finally, IP(. 2 ^) —)• 1 also 

uniformly over all A:-sparse (3 by Lemma 4.1 since e^n —>■ 00 . 

Hence, F(£/i n .s ^2 H . 2 ^ 3 ) —?■ 1 uniformly over all (3 with sparsity at most 
k. Consequently, it suffices to show that on this intersection. 


Il3-/3f 

2A:log(p/A:) 


< 1 + 3e, 


W-XI3f 

2k\og{p/k) 


< 1 + 3e. 


On ^ 2^2 n . 2 / 3 , all the eigenvalues values of XgoXso are between 1 — <5' and 
1 + (5'. By definition, all the coordinates of /3, f3 and /3 vanish outside of 5*. 
Thus, Lemma 4.2 gives 


\\^-l3\\<\\P-l3\\ + \\(3-(3\\< \\(3-P\\ 

/ 1 _|_ A \ 

naTTM) 

Hence, on n s^2 O .2^, we have 

^ l + (l + <5)(l + e)^ 

2k\og{plk) ~ (1 + (5/2)(l + (5') 2A:log(p/fe) 1 + 5' 

and similarly, 


wxp-xpf , ... Il^-/3|p ^ {l + 5){l + ef 

2k\og{p/k) ’2k\og{p/k) ^ ’ 1 + 5' 

This finishes the proof. 


l + 3e. 


5. Lower bounds. We here prove Theorem 1.3, the lower matching 
bound for Theorem 1.2, and leave the proof of Corollary 1.5 to Appendix 
A.4. Once again, we warm up with the orthogonal design and develop tools 
that can be readily applied to the regression case. 
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5.1. Orthogonal designs. Suppose y ~ Af{P,Ip). The first result states 
that in this model, the squared loss for estimating 1 -sparse vectors cannot 
be lower than 21ogp. The proof is in Appendix A.4. 


Lemma 5.1. Let Tp = {l + o{l))^J2 \ogp he a sequence obeying \J2 logp — 
Tp ^ oo. Consider the prior tt for (3, which selects a coordinate i uniformly 
at random in {1,... ,p}, and sets fdi = Tp and fij = 0 for j / i. For each 
e > 0, 


inf 

3 


11/3-/3f 

2 logp 


> 1-e 


^ 1 . 


Next, we state a counterpart to Theorem 1.1, whose proof constructs k 
independent 1 -sparse recovery problems. 


Proposition 5.2. Suppose k/p — )• 0. Then for any e > 0, we have 


inf sup P 

/3 m\o<k 


2 A:log(p/A:) 



^ 1 . 


Proof. The fundamental duality between ‘min max’ and ‘max min’ gives 

( \__ ( 


inf sup 

/3 ||/3||o<fe 


2k\og{p/k) 


> 1 — e > sup inf P. 


^\2k\og{p/k) 


> l-e 


Above, ^ denotes any distribution on such that any realization /3 obeys 
||/3||o < fc, and Ps(-) emphasizes that (3 follows the prior tt, as earlier in 
Lemma 5.1. It is therefore sufficient to construct a prior ^ with a right-hand 
side approaching one. 

Assume p is a multiple of k (otherwise, replace p with po = k[p/k\ and 
let TT be supported on {1,... ,po})- Partition {1,... ,p} into k consecutive 
blocks {1,... ,p/k}, {p/k + 1,..., 2plk} and so on. Our prior is a product 
prior, where on each block, we select a coordinate uniformly at random and 
sets its amplitude to r = (1 -|- o(l)) Y^log(p/A;) and \/2 \og{p/k) — r —>■ oo. 
Next, let (3 be any estimator and write the loss \\(3 — f3\\^ = Li + ■ ■ ■ + L^, 
where Lj is the contribution from the jth block. The lemma is reduced to 
proving 


(5.1) 


inf Ptt 

$ 


f Li + ■ ■ ■ + Lk 

\ 2klog{p/k) 



1 . 


For any constant e' > 0, since p/k —)• oo. Lemma 5.1 claims that 


inf P^ 
/3 


( L, 
\2log(j)/k) 



1 


(5.2) 
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uniformly over j = 1,... ,k since distinct blocks are stochastically indepen¬ 
dent. Set 

Lj = min{Lj, 2 log(p/A:)} < Lj. 

On one hand, 


E (Li -!-••• + Lfc) 
2A:log(p/A:) 


< (l-e)-IP,r 


/ Li + ■ ■ ■ + Lk ^ ^ 
\ 2A:log(p/A;) “ 


ej+n 


k Li + ■ ■ ■ + Lk 
k 2klog{p/k) 



On the other, 


E (Li Lk) ^ 1 — e' 

2klog{p/k) ~ k ^ 


Li 


2 log{p/k) 



All in all, this gives 


supP„ 

$ 


f “h ‘ ‘ ‘ “h L}^ \ 

( 2klog{p/k) - ^ " V 



e') inf Ett 


Li 


2log{p/k) 



Finally, take the limit p —)• oo in the above inequality. Since Lj /(2 log{p/k )) > 
1 — e' if and only if Lj/{2\og{p/k)) > 1 — e', it follows from (5.2) that 


lim sup sup P,r 

p->oo p 


V 2k\og{p/k) ) 


< 


e 


We conclude by taking e' —>■ 0. 


□ 


5.2. Random designs. We return to the regression setup y ~ Lf{XP, Ip), 
where X is our Gaussian design. 


Lemma 5.3. Fix a <1 and 

Tp,n = (\/2 logp - log yj2 logp^ (l - 2(logp)/n^ . 

Let TT be the prior from Lemma 5.1 with amplitude set to a ■ Tn,p. Assume 
(logp)/n —)• 0. Then for any e > 0, 


mf P,r 
/3 


\\p-(3r 

• 2 log p 



-)■ 1 . 


With this, we are ready to prove a stronger version of Theorem 1.3. 
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Theorem 5.4. [Stronger version of Theorem 1.3] Consider y ~ cr'^Ip), 

where X is our Gaussian design, k/p —)• 0 and \og{p/k)/n —)■ 0. Then for 
each e > 0, 


inf sup P 

3 ll^llo<fe 


0-2 • 2klog{p/k) 



1 . 


Proof. The proof follows that of Proposition 5.2. As earlier, assume that 
cr = 1 without loss of generality. The block prior tt and the decomposition 
of the loss L are exactly the same as before except that we work with 

T = (\/21og(p/A:) - log \/21og(p//c)) (l - 2y/log{p/k)/n'^ . 

Hence, it suffices to prove (5.2) in the current setting, which does not directly 
follow from Lemma 5.3 because of correlations between the columns of X. 
Thus, write the linear model as 

y = Xp + z = + z, 

where (resp. are the first p/k columns of X (resp. coordinates of 
/3) and all the others. Then 

z := + z ~ AA (O, {r'^ik - l)/n + !)/„) , 

and is independent of andSince r^(A:—l)/n+l > 1 and n/log(p/A:) — 
oo, we can apply Lemma 5.3 to 

y = + z. 


This establishes (5.2). □ 

6. Discussion. Regardless of the design, SLOPE is a concrete and 
rapidly computable estimator, which also has intuitive statistical appeal. 
For Gaussian designs, taking Benjamini-Hochberg weights achieves asymp¬ 
totic minimaxity over large sparsity classes. Furthermore, it is likely that 
our novel methods would allow us to extend our optimality results to de¬ 
signs with i.i.d. sub-Gaussian entries; for example, designs with independent 
Bernoulli entries. Since SLOPE runs without any knowledge of the unknown 
degree of sparsity, we hope that taken together, adaptivity and minimaxity 
would confirm the appeal of this procedure. 

It would of course be of great interest to extend our results to a broader 
class of designs. In particular, we would like to know what types of results 
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are available when the variables are correlated. In such settings, is there a 
good way to select the sequence of weights {Aj} when the rows of the design 
are independently sampled from a multivariate Gaussian distribution with 
zero mean and covariance S, say? How should we tune this sequence for 
fixed designs? This paper does not address such important questions, and 
we leave these open for future research. 

Finally, returning to the issue of FDR control it would be interesting 
to establish rigorously whether or not SLOPE controls the FDR in sparse 
settings. 
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APPENDIX A: PROOFS OF TECHNICAL RESULTS 

As is standard, we write x bn for two positive sequences an and bn 
if there exist two constants Ci and C 2 (possibly depending on q) such that 
CiOn < bn < C 2 an for all n. Also, we write an ~ bn if anjbn -A- 1. 


A.l. Proofs for Section 2. We remind the reader that the proofs in 
this subsection rely on some lemmas to be stated later in the Appendix. 


Proof of (2.1). For simplicity, denote by the (full) Lasso solution 
/Suasso, and 65 the solution to the reduced Lasso problem 

minimize ^jji/ - Xsb\\'^ -|- A||b||i, 
beiR* 2 


where S is the support of the ground truth /3. We show that (i) 
(A.l) + c/2)v^21ogp 

and (ii) 


(A.2) 


X^Xs{l3s - bs) < cJ{klog^p)/n 


for some constant C, both happen with probability tending to one. Now 
observe that X^{y — X 565 ) = X^z + X^Xs{l3s — bs)- Hence, combining 
(A.l) and (A.2) and using the fact that {klogp)/n —)• 0 give 


XUy - Xsh) 


< 


X!^Xs{f3s-h) 


+ ||xrz| 


< C^J{klog‘^p)/n + (1 + c/2)y/2logp 
= o(v/2logp) + (1 + c/2)-y/2 logp 

< (1 -I- c)v^2 logp 


with probability approaching one. This last inequality together with the 
fact that bs obeys the KKT conditions for the reduced Lasso problem imply 
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that padding bs with zeros on S obeys the KKT conditions for the full Lasso 
problem and is, therefore, solution. 

We need to justify (A.l) and (A.2). First, Lemmas A.6 and A.5 imply 
(A.l). Next, to show (A.2), we rewrite the left-hand side in (A.2) as 

XLXs{f3s - 6s) = XLXs{X'sXs)-\X^siy " ^sh) - X'sz). 

By Lemma A. 7, we have that 


X'siy - Xsbs) 



< \/fcA-|-||X 5 z|| < VkX+\/32k log{p/k) < C'^k logp 


holds with probability at least 1 — e — {\/2ek/p)^ —)■ 1. In addition. 
Lemma A. 11 with t = 1/2 gives 


I^s(^s^i 


N-ll 


< 


y/1 - Ijn - ^kX/n - 1/2 


< 3 


with probability at least 1 — e —)• 1. Hence, from the last two inequalities 

it follows that 


(A.3) 


Xs{X'sXs)-^{X's{y - Xsbs) - X'sz) 


< C"^Jklogp 


with probability at least 1 — e — {\/2ek/p)^ — e —> 1. Since XL is 
independent of Xs(X^Xs)~^(Xg(y — Xsbs) — X^z), Lemma A.6 gives 


XLXs{X'sXs)-HX's(.y - Xsbs) - X'sz] 


< 


2 logp 


n 


Xs{X'sXs)-Hx's{y - Xsbs) - X^jz) 


with probability approaching one. Combining this with (A.3) gives (A.2). 
Let bs be the solution to 

1 II l|2 

minimize - /3s + XoZ — 6 -|- A||6||i. 

beK* 2 " 

To complete the proof of (2.1), it suffices to establish (i) that for any constant 

d > 0, 


(A.4) 


sup P 


|6s-/3sll' 


0 <k \2{l + c)'^klogp 


> 1-6 
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and (ii) 

(A.5) ll^s - Ssll = Op (^||fo5 -/35||^ 


since (A.4) and (A.5) give 


(A.6) 


sup P 
ll/3||o<fc 


wbs-Psr 

2(1 + c)2felogp 


1 


for each <5 > 0. Note that taking 5 = 1 — 1/(1 + c)^ in (A.6) and using the 
fact that bs is solution to Lasso with probability approaching one finish the 
proof 

Proof of (A.4). Let /3i = oo if i G 5 and otherwise zero (treat oo as a 
sufficiently large positive constant). For each i G S, = /3i + X[z — A, and 


\hs,i-l3i\ = \X[z-\\>\-\X[z\. 

On the event {maxjg^ 1^' z\ < A}, which happens with probability tending 
to one, this inequality gives 


\\bs - f3sf > - \Xiz\f = k\^ -2\Y, + Y.^X[zf 

i&S ies ies 

= (1 + op(l))2(l + cfklogp, 

where we have used that both Yli&s Of>{k). This 

proves the claim. 

Proof of (A.5). Apply Lemma 4.2 with T replaced by S (here each of 
bs, bs and (3 is supported on S). Since fe/p —)■ 0, for any constant 5' > 0, all 
the singular values of Xs lie in (1 — 6', 1 + <5') with overwhelming probability 
(see, for example, [56]). Consequently, Lemma 4.2 ensures (A.5). 

□ 


Proof of (2.2). We assume cr = 1 and put A = As in the proof of 
Theorem 1.1, we decompose the total loss as 

l^se, - f3f = ||3se„5 - I3sf + - Psf = ll/3se„5 " l3sf + 11^., 5 H'' 

The largest possible value of the loss off support is achieved when is 
sequentially soft-thresholded by A“I^1. Hence, by the proof of Lemma 3.3, 
we obtain 

®^ll^seq,sll^ = o{2klog{p/k)) 


for all A:-sparse (3. 
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Now, we turn to consider the loss on support. For any i € S, the loss is 
at most 

(l^il + Xr{i)) = ^r(j) + 

Summing the above equalities over alH G S' gives 

k 

IE ||/3seq,5 — PsW'^ < ^ ^ zf + 2 ^ \zi\\r(iy 

i=l i&S i&S 


Note that the first term Af = (1 + o(l)) 2k\og{p/k), and the second 

term has expectation = ^ = o(2A:log(p/A:)), so that it suffices to 

show that 


(A.7) 


E 


2 ^ \ Zi\K(i) 

. ies 


o {2klog{p/k)). 


We emphasize that both Zi and r{i) are random so that {Ar(j)}ies and 
{zi}i^S not be independent. Without loss of generality, assume S = 
{1 ,... ,k} and for 1 < i < k, let r'{i) be the rank of the ith observation 
among the first k. Since A is nonincreasing and r'{i) < r{i), we have 


l<i<A; l<i<fc l<i<k 


where |^;|(i) > • • • > \^\{k) the order statistics of zi,..., Zk- The second 
inequality follows from the fact that for any nonnegative sequences {oj} 
and {bi}, J2i < Y^i Therefore, letting Ci, ■ • •, Cfc be i.i.d. Af{0,1 ), 

(A.7) follows from the estimate 

k 

(A.8) ^AiE|C|(i) = o(2A:log(p/A;)). 

i=l 


To argue about (A. 8), we work with the approximations A* ~ \/21og(p/z) 
and E|C|(j) = O (^^/2log{2k/i)^ (see e.g. (A. 15)), so that the claim is a 
consequence of 



= o {2k log{p/k )), 
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which is justified as follows: 

k 


E 

2=1 


log - log — < A: 

i i 


log Ai: log 

X X 


< k 




log - + 

X 


2\/log(p/A:) 


V k y/log{p/k) 

for some absolute constants Ci, ( 72 . Since log{p/k) —)• oo, it is clear that the 
right-hand side of the above display is of o{2klog{p/k)). □ 


A.2. Proofs for Section 3. To begin with, we derive a dual formula¬ 
tion of the SLOPE program (1.6), which provides a nice geometrical inter¬ 
pretation. This dual formulation will also be used in the proof of Lemma 
4.3. Our exposition largely borrows from [16]. 

Rewrite (1.6) as 


(A.9) minimize ^ ||^’||^ + Aj| 6 |(j) subject to Xb + r = y, 

b.r 2 ^ 

i 

whose Lagrangian is 

C{b,r,v>) := ^ ||rf-h ^ Ai|6|(i) - u'{Xb + r - y). 

i 

Hence, the dual objective is given by 

inf C{b,r,u) = iz'y - sup |izV - | - sup |(XV)'b - ^ Ai|6|(i) | 


/ '^11 112 
= V — w\\ — 

y 2 " " 


0 u e (7a,X 
-|-oo otherwise, 


where C\ x '■= {i' '■ X'u is majorized by A} is a (convex) polytope. It thus 
follows that the dual reads 

(A.10) maximize v'y — subject to i/ G C\ x- 

The equality i^'y — llizlp/2 = —\\y — u\\‘^/2 -|- ||y|p/2 reveals that the dual 
solution £/ is indeed the projection of y onto (7a, x. The minimization of 
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the Lagrangian over r is attained at r = u. This implies that the primal 
solution (3 and the dual solution V obey 

(A.ll) y-Xp = V. 

We turn to proving the facts. 

Proof of Fact 3.1. Without loss of generality, suppose both a and h 
are nonnegative and arranged in nonincreasing order. Denote by r“ the sum 
of the first k terms of a with T“ = 0, and similarly for b. We have 

P P~^ P 

ll®lP = P“-i) = '^T^{ak—ak+i)+apTp > {ak—ak+i)+apTp = ^ akbk- 

k=l k=l k=l k=l 

Similarly, 

p—1 P P 

ll&f = Y.T^{bk-b,+,)+bpT^ < Y^T^ibk-bk+i)+bpT^ = ^bk{T^-T^_,) = ^aubk, 

k=l k=l k=l k=l 

which proves the claim. □ 

Proofs of Facts 3.2 and 3.3. Taking X = Ip in the dual formula¬ 
tion, (A.ll) immediately implies that a — prox_;^ (a) is the projection of a 
onto the polytope C\ i^. By definition, Cxj^ consists of all vectors majorized 
by A. Hence, a — prox_;^ (a) is always majorized by A. In particular, if a is 
majorized by A, then the projection a —prox_)^ (a) of a is identical to a itself. 

This gives prox_;^ (a) = 0. □ 

Proof of Fact 3.4. Assume a is nonnegative without loss of generality. 

It is intuitively obvious that 

b > a prox^,^ (6) > prox_;^ (a), 

where as usual b > a means that b — a € M((_. In other words, if the obser¬ 
vations increase, the fitted values do not decrease. To save time, we directly 
verify this claim by using Algorithm 3 (FastProxSLl) from [16]. By the 
averaging step of that algorithm, we can see that for each 1 < i, j < p, 

g[proXA (g)], ^ f #{l<fc<p: [proxJa)],=[prox^(a)]^-} > («), = PrOX^ («)* > 0, 

[o, otherwise. 

This holds for all a € MI except for a set of measure zero. The nonnegativity 
of d [prox_;^ (a)] • /daj along with the Lipschitz continuity of the prox imply 
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the monotonicity property. A consequence is that || [prox;^ II does not 
decrease as we let o* —?■ oo for all i G T. In the limit, || [prox_;^ II ™ono- 
tonically converges to || prox_^_|T| ||- This gives the desired inequality. 

□ 


As a remark, we point out that the proofs of Facts 3.2 and 3.3 suggest 
a very simple proof of Lemma 3.1. Since a — prox_;^ (a) is the projection of 
a onto C\j^, II prox_;^ (a) || is thus the distance between a and the polytope 
Cxjp. Hence, it suffices to find a point in the polytope at a distance of 
||(|a| — A)+|| away from a. The point b defined as bi = min{|aj|,Aj} does 
the job. 

Now, we proceed to prove the preparatory lemmas for Theorem 1.1, 
namely. Lemmas A.3 and A. 4. The first two lemmas below can be found 
in [3|. 

Lemma A.l. Let U be a Beta(a, 6 ) random variable. Then 
Elog 1 / = (logr(a))' - (log r(a + 6 ))', 

where T denotes the Gamma function and (logr(x))' is the derivative with 
respect to x. 


Lemma A.2. For any integer m>\, 

m-l ^ ^ 

(logr(m ))'=-7 + - = logm + of—Y 

7 Vm/ 

7=1 

where 7 = 0.577215 • • • is the Euler constant. 


Lemma A.3. Let ^ ~ A/'(0,/p_fc). Under the assumptions of Theorem 

1.1, for any constant A > 0 , 

LAfcJ 

2klog{p/k) U ^ °- 

Proof of Lemma A.3. Write A* = A™ for simplicity. It is sufficient 
to prove a stronger version in which the order statistics |Cl(i) come from p 

1.1. d. AA(0,1). The reason is that the order statistics will be stochastically 
larger, thus enlarging E (|((|(j) — A™J^ since (C — A)^ is nondecreasing in 
C- Applying the bias-variance decomposition, we get 

(A.12) 

E (ICI(i) - Xk+i)l < E (|C|(q - Xk+i)^ = Var(|C( 7 l) + (E |C(7)| - Xk+i)^ . 
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We proceed to control each term separately. 

For the variance, a direct application of Proposition 4.2 in [19] gives 

(A.13) Var(|C,.,|) = 

for all i < p/2. Hence, 

[AfeJ / [AfcJ 

Vai(|C,i|l) = 0l j;. 


E 


1 


log(p/i) 


= o{2k\og{p/k)), 


where the last step makes use of \og{p/k) —?■ oo. It remains to show that 
[AfcJ 

(A.14) ^ (E|C(.)|-Afc+0' = o(2A:log(pA)). 

i=l 

Let Ui, ... ,Up be i.i.d. uniform random variables on (0,1) and [/(j) be the 
smallest—please note that for a change, the U/s are sorted in increasing 
order. We know that U(^i) is distributed as Beta(i,p + 1 — i) and that |C|(j) 
has the same distribution as Making use of Lemmas A.l 

and A.2 then gives 

P 1 

EjCl?,) =E[ch-ni-C/(,)/2)2] ~E[21og(2/C/(,))] =21og2+2 = (l+o(l))2 log(pA), 

j=i ^ 

where the second step follows from (1 + op(l))2log(2/?7(j)) < — 

C/(j)/2)^ < 2 log(2/C/(j)) for i = o{p). As a result, 

E |C(i)| < VlEjClo) = (1 + o(l))V2 log{p/i) 

(A.15) ^^_ 

E |C(.)| = ^E |C|2.) - VardClw) = (1 + o(l))^2log(p/i). 

Similarly, since k + i = o{p) and q is constant, we have the approximation 

(1 + o{l))^y2log{p/{k + i)), 


which together with (A.15) reveals that 
(A.16) 

(lE|C(i)l - Afc+i)^ < (l+o(l))2 ^/log{p/i) - \/log{p/{k + i)) +o{l)\og{p/i). 


-1 2 
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The second term in the right-hand side contributes at most o(l) Ak log(p/ {Ak)) = 
o(l) 2k\og{p/k) in the sum (A.14). For the first term, we get 

log^(l + k/i) 


\/log(p/f) - ^/\og{p/{k + i)) 


■1 2 


^/\og{p/i) + ^/\og{p/{k + i)) 


1 2 


= o(l) log ( 1 +fc/i). 


Hence, it contributes at most 
(A.17) 

[Ak\ [Ak\ 

o(l) X] X] ^ 


2=1 


2 = 1 


i-l 

k 


log^(l + l/x)dx 


= o(l)fc / log^(l + l/x)dx = o(2fclog(p/A:)). 

Jo 

Combining (A.17), (A. 16) and (A. 14) concludes the proof. 


□ 


Lemma A.4. Let ^ ~ Af {0, Ip-k) and A > 0 be any constant satisfying 
q(l -|- A)/A < 1. Then, under the assumptions of Theorem 1.1, 

p—k 


1 


2A:log(p/A:) 


2 = [A/c] 


Proof of Lemma A.4. Again, write Aj = Af^ for simplicity. As in the 
proof of Lemma A.3 we work on a stronger version by assuming ^ ~ Af{0, Ip). 
Denote by q' = q{l + A)!A. For any rt > 0, let := P(|AA(0,1)| > 
Xk+i + u) = 2<1)(—Afc+j — u). Then P(|C|(j) > X^+i + u) is just the tail proba¬ 
bility of the binomial distribution with p trials and success probability 
By the Chernoff bound, this probability is bounded as 

(A.18) P(|C|« > Xk+^ + u) < 

where KL(a||6) := alog ^-|-(1 —o) log is the Kullback-Leibler divergence. 
Note that 


(A.19) 


dKL{i/p\\b) ^ _i/p 1 - i/p ^ _i^ 
db b 1 — b ~ pb 


for all 0 < 6 < i/p. Hence, from (A.19) it follows that 
(A.20) 


KL {i/p\\au) - KL {i/p\\ao) = - 


rao qyLL 


cao 


db 


db> --Idb 


pb 


rao 


> 


/ a; - 

iuXf^.\-i 


p 


Oto 


1 _ , 
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where the second inequality makes use of < e With the proviso 

that g(l + A)/A < 1 and i > Ak, it follows that 

(A.21) oiQ = q{k + i)/p < q'i/p. 

Hence, substituting (A.21) into (A.20), we see that (A. 18) yields 

(A.22) ^ ^_p(KL(l||a„)-KL(i||ao)) 

< exp [-iu\k+i + g'*(l - exp ( - u\k+i))) ■ 
With this preparation, we conlude the proof of our lemma as follows: 

POO 

E (ICI(i) - Xk+i) , = / E ((|CI« - Xk+i)l > x) dx 
Jo 

poo 

= / E(|C|(i) > Xk+i + Vx)dx 

Jo 

poo 

= 2 tiE(|CI(i) > Xk+i + u)du, 

Jo 

and plugging (A.22) gives 

^ TOO 

E (lCl(i) - <2 J uexp - iuXk+i + q'i{l - exp ( - uXk+i))jdu 


Jo 


^l+i Jo 


2 /■“ 

< T 2 / 

Jo 

This yields the upper bound 


-(x-q'{l-e 


P-k „ p-k 

E (ICl(i) - -^fc+i)+ < ^ X/ 

i=\Ak'\ 


xe 




< 


P i=\Ak'] 

2 


>0 


poo 

$-1(1 - g/2)2 _/() 1 - e-(a;-9'(i-e-")) 


^ $- 1 ( 1 - 5 / 2 ) 2^70 
2 /■“ 


Since the integrand obeys 


lim 


xe 


-{x-q'{l-e ^)) 


x^o 1 - 1 - g' 
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and decays exponentially fast as x —)• oo, we conclude that 

is bounded by a constant. This is a bit more than we need since 
2k\og{p/k) ^ oo. □ 

A.3. Proofs for Section 4. In this paper, we often use the Borell 
inequality to show that P(||A/’(0, In)\\ > y/n + t) < exp(—1^/2). 

Lemma A.5 (Borell’s inequality). Let ~ AA(0,/n) o,nd f be an L- 
Lipschitz continuous function in M"'. Then 

¥{fiC)>Ef{C)+t)<e-^ 

for every t > 0. 

Lemma A.6. Let fi,... ,fp be i.i.d. Af{0,l). Then 

max|Ci| < a/2 log p 

i 

holds with probability approaching one. 

The latter classical result can be proved in many different ways. Suffices 
to say that it follows from a more subtle fact, namely, that 

v/ 21 ^ (maxC. - T^P+ ‘°^ 2 " 1 /°" ) 

converges weakly to a Gumbel distribution [26]. 

Proof of Lemma 4.3. Let be the lift of bx in the sense that 3/^* = 
Lt and 3/^* = 0 and let |T| = m. Further, set u := y — Xxbx = y — 
Applying (A. 10) and (A. 11) to the reduced SLOPE program, we get that 

Xfv < A[™1. 

By the assumption, XLu is majorized by A”!™!. Hence, X'u —the concate¬ 
nation of X'rpU and XLv —is majorized by A = (AI"*], A“["^]). This confirms 
that u is dual feasible with respect to the full SLOPE program. If addition¬ 
ally we show that 

y _ = V'y - 


(A.23) 


2 
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then the strong duality claims that and v must, respectively, be the 
optimal solutions to the full primal and dual. 

In fact, (A.23) is self-evident. The right-hand side is the optimal value of 
the reduced dual (i.e., replacing X. and A by Xt and A^"*] in (A.10)), while 
the left-hand side agrees with the optimal value of the reduced primal since 

, pm 

-\\y - = -\\y - xStIP and J]] ^ Ai|6r|(i). 

i=l i=\ 

Since the reduced primal only has linear equality constraints and is clearly 
feasible, strong duality holds, and (A.23) follows from this. □ 


Lemma A.7. Let 1 <k* < p he any (deterministic) integer, then 

sup ||ACyz|| < 1 / 32 A:* log{p/k*) 

\T\=k* 

with probability at least 1 — — {\/2ek*/p)^*. Above, the supremum is 

taken over all the subsets of {1, ■ ■ ■ ,p} with cardinality k*. 


Proof of Lemma A.7. Conditional on z, it is easy to see that X'z 
is distributed as i.i.d. centered Gaussian random variables with variance 


|z|p/n. This observation enables us to write 


X'z^^^{Cu...Xp): 


n 


where C := (Ci) ■ • •) Cp) consists of i.i.d. AA(0,1) independent of ||z||. Hence, 
it is sufficient to prove that 

|| 2 :|| < 2^/r^, |Cl(i) + --- + lCl(fc*) < S/c*log(p//c*) 

fc* 

simultaneously with probability at least 1 — — {y/2ek*/p) . From 

Lemma A.5, we know that IP(|| 2 ;|| > 2y/n) < so we just need to 

establish the other inequality. To this end, observe that 


(icifl) H + |C|(A:*) > 8/c* log(p/fe*)^ < 


Ee^ 


2 

(!)■ 


•+ICP 


(fe*) 


< 


E 




E e 




< 


e2fcMogi 

V2ek*\k* 

p 
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□ 

We record an elementary result which simply follows from — c/2) < 

\J2 log 1 /c for each 0 < c < 1. 

Lemma A. 8. Fix 0 < q < 1. Then for all 1 < k < p/2, 

k 

(AD^ < Cq ■ klog{p/k), 

i=l 

for some constant Cq > 0. 

In the next two lemmas, we use the BHq critical values to majorize 
sequences of Gaussian order statistics. Again, a <b means that b majorizes 

a. 


Lemma A.9. Given any constant c > 1/(1—g), suppose m.ax{ck,k+d} < 
k* < p for any (deterministic) sequence d that diverges to oo. Let ("i,..., (p-k 
be i.i.d. Af{0,l). Then 

(lCI(fc*-A;+l), |Cl(fc*-fc+2)) • • • ) ICl(p-fc)) ^ (Afc?+l, Afc?+2, . . . , Ap“) 

with probability approaching one. 

Proof of Lemma A.9. It suffices to prove the stronger case where ~ 
AA(0,/p). Let Ui,...,Up be i.i.d. uniform random variables on [0,1] and 
C^(i) < • • • < P(p) the corresponding order statistics. Since 

(|C|(fc*-fc+l), • • • , ICl(p-A:)) = ($■'(! - t/(fc*-fc+l)/2), . . . , ‘h-Hl - U^p-k)/2)) , 

the conclusion would follow from 


P {U(^k*-k+j) > +j)/p, Vj G {1,... ,p-k*}) 1. 

Let El,, Ep+i be i.i.d. exponential random variables with mean 1 and 
denote hy Ti = Ei + ■ ■ ■ + Ei. Then the order statistics have the same 
joint distribution with Tj/Tp+i. Fixing an arbitrary constant q' G {q, 1 —1/c), 
we have 

P (C/(fc*-fc+j) > q{k* + j)/p, Vj) > P {Tk*-k+j > q'{k* + j), Vj)-P (Tp+i > qp/q) . 

Since P (Tp+i > q'p/q) —)• 0 by the law of large numbers, it is sufficient to 
prove 


(A.24) 


P {Tk*-k+j > Q + j)) Vj G {1 ,... ,p — k*}) —>■ 1. 
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This event can be rewritten as 


Tk* —k+j Tk* — k Q j ^ Q k Tk* — k 

for all 1 < j < p — k*. Hence, (A.24) is reduced to proving 

(A.25) min Tk*-k+j - Tk*-k - q j > qk* - Tk*-k) ^ I- 

\^<j<p-k* J 

As a random walk, Tk*-k+j — Tk*-k — q'j has i.i.d. increments with mean 
1 — g' > 0 and variance 1. Thus mini<j<p_fc* Tk*-k+j — Tk*-k — q'j converges 
weakly to a bounded random variable in distribution. Consequently, (A.25) 
holds if one can demonstrate that q'k* — Tk*-k diverges to —oo as p —)■ oo 
in probability. To see this, observe that 

q'k* - Tk*-k = -k)- Tk^-k < ^{k" - k) - Tk*-k, 

where we use the fact k* > ck. Under our hypothesis (?^c/(c — 1) < 1, the 
process {q'ct/{c — 1) — : t G N} is a random walk drifting towards —oo. 

Recognizing that k* — k > d ^ oo, we see that q'c{k* — k)/{c — 1) — Tk*-k 
(weakly) diverges to —oo since it corresponds to a position of the preceding 
random walk at t —>• oo. This concludes the proof. 

□ 

Lemma A.10. Let Ci; • ■ • jCp-fc i.i.d. AA(0,1). Then there exists a con¬ 
stant Cq only depending on q such that 

with probability tending to one as p ^ oo and k/p —)• 0. 

Proof of Lemma A.10. Let C/i,..., be i.i.d. uniform random vari¬ 
ables on [0,1] and replace Q by <h“^(l — Ui/2). Note that 

4.-'(l-t/./2)<\/21ogL AS, X ^2log 

Hence, it suffices to prove that for some constant 

(A.26) log(2/C/(j)) \og{p/k) < Kq ■ logp ■ log{2p/{k -L i)) 
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holds for all i = 1,... — A; with probability approaching one. Applying 

the representation given in the proof of Lemma A.9 and noting that 
(1 + op(l))p, we see that (A.26) is implied by 

(A.27) \og{2,p/Ti) \og{p/k) < • logp • log(2p/(A: + i)). 

We consider i < A^yp and i > 4^ separately. 

Suppose hrst that i < 4y^. In this case, 

log(2p/(/c + i)) = {1 + o{l))log{p/k). 

Thus (A.27) would follow from 

log(3p/ri) = O(logp) 


for all such i. This is, however, self-evident since Ti > Ei > 1/p with prob¬ 
ability 1 — = o(l). 

Suppose now that i > 4y/p. In this case, we make use of the fact that 
Ti > i/2 — ^Jp for all i with probability tending to one as p —)• oo. Then we 
prove a stronger result, namely, 

for all i > This follows from the two observations below: 


logT 


2>p 


*/2 - y/P 


log-, log^—— 
^ k +1 


> min 


in|log^,log^| 


□ 


In the proofs of the next two lemmas, namely, Lemma A. 11 and Lemma 
A.12, we introduce an orthogonal matrix Q G that obeys 

Qz = (|| 2 :||, 0 ,..., 0 ). 

In the proofs, Q is further set to be measurable with respect to 2 ;. Hence, 
Q is independent of X. There are many options available to construct such 
a Q, including the Householder transformation. Set 


W = 


w 

W 


:= QX, 
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where w G and W G The independence between Q and 

X suggests that W is still a Gaussian random matrix, consisting of i.i.d. 
AA(0,1/n) entries. Note that 

X[z = {QXi)'{Qz) = \\z\\{QX,), = \\z\\wi. 

This implies that S* is constructed as the union of S and the k* — k indices 
in {l,...,p} \ 5 with the largest \wi\. Since w and W are independent, 
we see that both and W 5 * are also Gaussian random matrices. These 
points are crucial in the proof of these two lemmas. 

Lemma A.11. Let k < k* < min{n,p} he any (deterministic) integer. 
Denote by (T min and Umax; respectively, the smallest and the largest singular 
value of Xs* ■ Then for any t > 0, 

fJmin > \/l - 1 /n - sjk* jn - t 
holds with probability at least 1 — Furthermore, 

o-max < \/l - 1 /n + \Jkd' jn + v^ 8 /c* log(p//c*)/n + t 
holds with probability at least 1 — _ [y/2ek*/p)^*. 

Proof of Lemma A.ll. Recall that Ws* G is a Gaussian 

design with i.i.d. AA(0,1/n) entries. Since W 5 * and Xs* have the the same 
set of singular values, we consider W 5 *. 

Classical theory on Wishart matrices (see [56], for example) asserts that 
(i) all the singular values of Ws* are larger than — \/n — n — t with 
probability at least 1 — and (ii) are all smaller than + 

n + t with probability at least 1 — e . Clearly, all the singular 
values larger of Ws* are at least as large as (Tniin(Ws'*). Thus, (i) yields 
the first claim. For the other, Lemma A.7 asserts that the event Urns'* II ^ 
*y8k* log{p/k*) happens with probability at least 1 — {\/2ek*/p)^*. On this 
event, 

||Ws*|| < II Ws* P + 8k* log{p/k*), 
where || • || denotes the spectral norm. Hence, (ii) gives 

||^s*|| < IIWs*||+\/ 8 A:*log(p/ k*) < a /1 — l/n+\/k*/n-\-t+*,/8k*\og{p/k*) 
with probability at least 1 — D _ (^y/2ek*/p)^*. 

□ 
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Lemma A. 12. Denote by bs* the solution to the reduced SLOPE problem 
(4.6) with T = S* and A = Ag. Keep the assumptions from Lemma A.9, and 
additionally assume k*/ min{n,p} —)■ 0. Then there exists a constant Cq only 
depending on q such that 

xyxs*if3s^ -bs*)<Cq- ,..., 

with probability tending to one. 


Proof of Lemma A. 12. In this proof, C is a constant that only de¬ 
pends on q and whose value may change at each occurrence. Rearrange the 
objective term as 

xyxs^ips* - bs*) = xyXs*{X^s*Xs*)-\X^s*iy - Xs*bs*) - 

= xyQ'QXs*{X's*Xs*)-Hx's*iy - Xs*bs*) - X's* 

= xyq't 

where 

^ := QXs*{X^s*Xs*)-^ [x's*{y - Xs*bs*) - X^s*z) . 


For future usage, note that ^ only depends on w and Ws* and is, therefore, 
independent of W^. 

We begin by bounding ||^||. It follows from the KKT condition of SLOPE 
that X'g^,{y — Xs*bs*) is majorized by A^^*!. Hence, it follows from Fact 3.1 
that 


(A.28) 


X's.{y - Xs*bs*) 


< ||a[^*] 


Lemma A. 11 with t = ll2 gives 

(A.29) < (Vl - 1/^ - \/k*/n - I/ 2 ) ^<2.01 


with probability at least 1 — e for sufficiently large p, where in the last 
step we have used k*/n —)• 0. Hence, from (A.28) and (A.29) we get 


1^11 < ||Xs*(X;j*X5 


\-i| 


X's.{y - Xs*bs*) - X's*z 


(A.30) 


< 2.01 (||A[^*]|| +4v^2A:*log(p/A:*)) 

< 2.01 ^(1 + e)VC + Ay/ 2 ^ \/k* log{p/k*) 
= C ■ \/k* \og{p/k*) 
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with probability at least 1 — — {\f2ek*/p)^* — —)> 1; we used 

Lemma A.7 in the second line and Lemma A.8 in the third. (A.30) will help 
us in finishing the proof. 

Write 

(A.31) X!^Xs^{(3s*-hs*) = X^Q'i = = (%, O) ^+(o, ^ 


It follows from Lemma A.9 that w-^ is majorized by • • • > 

in probability. As a result, the first term in the right-hand side obeys 


(A.32) 

O) ^ = Cl • 





\ BH 



with probability tending to one. For the second term, by exploiting the 
independence between C and W^, we have 


(o, • • ■ • 

where Ci, • • •, Cp-fc* are i.i.d. AC(0,1/n). Since k*/p —>■ 0, applying Lemma 
A. 10 gives 


(Cl,•• 


Cp-k*) ^ C ■ 


I logP (.BH 

^log{p/k*) 



with probability approaching one. Hence, owing to (A.30), 


(A.33) (O. £ S C . , A») 

holds with probability approaching one. Finally, combining (A.32) and (A.33) 
gives that 


xyXs^Ps^ -bs*) = (%, 0) C + (0, Wy) C 


^c.U/-iog|,+ 


p ^ lk*logp\ 

n I 


(A 


BH 

/c* + l 5 • ' 






k* log p 


n 


/\BH \BH'i 


holds with probability tending to one. 


□ 
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A.4. Proofs for Section 5. 


Lemma A. 13. Keep the assumptions from Lemma 5.1 and let fi,... ,Cp 
be i.i.d. AA(0,1). Then 

# {2 < i < p : Ci > T + Cl} oo 

in probability. 


Proof of Lemma A. 13. With probability tending to one, r' := r + Ci 
also obeys r'/V21ogp —)• 1 and \/2 logp — r' —)• oo. This shows that we only 
need to prove a simpler version of this lemma, namely, {1 < i < p : Ci > t} ^ 
oo in probability. 

Put A = y/2logp—T = o{y/2logp) and o = P(Ci > r). Then, ff {I < i < p : Ci > t} 
is a binomial random variable with p trials and success probability a. Hence, 
it suffices to demonstrate that ap —)> oo. To this end, note that 


. ^ 1 1 

a = 1 — <h(T) ~-^e 2 


1 


logp-A2/2+AV21ogp 


\/21ogp 

1 


which gives 


1 


ap : 


P\/2 logp 
^(l+o(l))AV2 logp 


^(l+o(l))AV21ogp 


V21ogp 

Since A —)• oo (in fact, it is sufficient to have A bounded away from 0 from 
below), we have 

g(i+o(i))AV2ki^/y^l^ ^ 

as we wish. □ 


Proof of Lemma 5.1. For sufficiently large p, 2(1 — e)logp < (1 — 
elT)T^. Hence, it is sufficient to show 


/3-/9f < (l-e/2)T' 


0 


uniformly for all estimators (3. Letting I be the random coordinate, 

which is smaller than or equal to (1 — e/2)if and only if 

- 2||3f + er2 


/ 3 /> 


4r 
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Denote by ^ = A{y] (3) the set of all i G {1,... ,p} such that /3j > (2||/3|p + 
er^)/(4r), and let b be the minimum value of these /3i. Then 


+ er^ ^ 2\A\P + er^ ^ 2^2\A\P-eT^ 
~ 4t ~ At ~ At 


which gives 


(A.34) |A| < 2/e. 

Recall that \\(3 — /3|p < (1 — ^I‘2‘)t'^ if and only if 1 is among these |A| 


components. Hence, 
(A.35) 



= P^(/G A|y) = ^P^(/ 
ieA 


i\y) 


Er=ie-^*’ 


where we use the fact that A is almost surely determined by y. Since (A.35) 
is maximal if A is the set of indices with the largest y/s, (A.35) and (A.34) 
together yield 


[0 - Pf < (1 - e/2y) 

< Pn- (pji = r + z/ is at least the 1'2/e]largest among yi,..., yp^ -A 0, 
where the last step is provided by Lemma A. 13. 

□ 


Proof of Lemma 5.3. To closely follow the proof of Lemma 5.1, denote 
by A = A{y,X;P) the set of all i G {l,...,p} such that Pi > (2||/3|p + 
ea^r^)/(4aT), and keep the same notation b as before. Then \\P — /3|p < 
(1 — e/2)a^r^ if and only if I G A. Hence, 


(A.36) 

P.(||^-/3f <(1 


e/2)Q!^r^ 


y,x 


= P^(IG A|y,X) 


= = i\y,X) 

ieA 

_ EieA exp(arX]j/ - a^T'^\\Xif/2) 
Ef=i exp(aTX'y - a^T'^\\Xi\P/2) 


and this quantity is maximal if A is the set of indices i with the largest 
values of X'-y/a — T||Xj|p/2. As shown in Lemma 5.1, |A| < 2/e, which 
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gives 


(A.37) (p-/3f < (l-e/2)aV) 

< Ptt (^X'ly/a — r||X/||^/2 is at least the [2/6]*^ largest^ . 

We complete the proof by showing that the probability in the right-hand 
side of (A.37) is negligible uniformly over all estimators /3 as p —)• oo. By the 
independence between I and X,z, we can assume 1 = 1 while evaluating 
this probability. With this in mind, we aim to show that there are sufficiently 
many z’s such that 

is positive. Since 

X[z/ 0^+2 ll-^ilP — /Oi) + {}- + Op(l/\/™)) ) 

it suffices to show that 
(A.38) 

#{2<i<p-.X[{z/a + TXi)-UXif>^ + ^- + ^\ < [2/61-1 
[ 2 a 2 y/n J 

holds with vanishing probability for all positive constants Ci, (72 • By the 
independence between Xi and zja ^ tXi, we can replace zja tXi by 
{\\z/a + tXi II, 0,..., 0) in (A.38). That is, 

X[ {z/a + tXi) - "^\\Xif = \\z/a + rWi|| - ^||X,,_if, 

where Xi-i G is Xi without the first entry. To this end, we point out 
that the following three events all happen with probability tending to one: 


#{2<i<p: ||Xi,_i|| <l}/p^l/2. 


(A.39) 


2 . 21ogp 

maxAj I < -, 


|z/q; -|- tXi II > 


n 
n — 


Vlogp) /a. 


Making use of this and (A.38), we only need to show that 


A^ - # |2 < i < 0.49p : i (^1 - {logp)/n^ VnXip 
# |2 < i < 0.49p : ^(^1- ^ (logp)/n^ \/nXip > t + 


rlogp T Cl T CoT 

> - — + - + — + - + ^ 

n 2 a 2 ^/n 

rlogp ^ C2t \ 

n a -y/n j 











obeys 

(A.40) 
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N < \2/e\ - 1 


with vanishing probability. The first line of (A.39) shows that there are at 
least 0.49p many i’s such that || || < 1 and we assume they correspond to 

indices 2 < i < OAdp without loss of generality. (Note that N is independent 
of all Xj^_i’s.) Observe that 


T + T{logp)/n + Cl/a + C2T/^/n „ 

r := -;-^- = a \ i + z 


1 - v'(logp)/n) /a 


logp \ 


n 


r + 0(1) 


for sufficiently large p (to ensure (logp)/n is small). Hence, plugging the 
specific choice of r and using a < 1, we obtain 

r < (^1 + 2^ (logp)/n^ r + 0(1) < ^J2 \ogp - log \/2lo^ + 0(1), 


which reveals that log(0.49p) — t' = ^J2 logp — r' + o(l) —>• oo. Since 
y/nXn,i are i.i.d. 1), Lemma A.13 validates (A.40). □ 


Proof of Corollary 1.5. Let c > 0 be a sufficiently small constant 
to be determined later. It is sufficient to prove the claim with p replaced 
by a possibly smaller value given by p* := min{[cnj,p} (if we knew that 
Pi = 0 for p* + 1 < i < p, the loss of any estimator Xf3 would not increase 
after projecting onto the linear space spanned by the first p* columns). 
Hereafter, we assume X G and (3 G MP*. Observe that p = 0(n) 

implies p = 0{p*) and, therefore, 

(A.41) log(p*/fc) ~ log(p/A:). 


In particular, k/p* —)• 0 and n/log{p*/k) —)> oo. This suggests that we can 
apply Theorem 5.4 to our problem, obtaining 


inf sup 


11 / 3 -/ 3 ||^ 


> 1-e' 


1 . 


$ \\f3\\o<k \2klog{p*/k) 
for every constant e' > 0. Because of (A.41), we also have 


(A.42) 


inf sup I 
$ ||/3||o<fc 


2klog{p/k) 


> 1 - e' ^1 


for any e' > 0. 
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Since p*/n < c < 1, the smallest singular value of the Gaussian random 
matrix X is at least 1 — ^/c + op(l) (see, for example, [56]). This result, 
together with (A.42), yields 


inf sup P 

p ||^||o<fc 


( \\XP-X(3f 
y 2k\og{p/k) 





1 


for each e' > 0. Finally, choose c and e' sufficiently small such that (1 — 

^)2(1 - e') > 1 - e. □ 
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